In this report we will be analyzing Airbnb data from the Airbnb open new york city 2019 data set.
Airbnb is a platform that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. This platform is accessible through website or mobile app.
The data we are currently analyzing comes from : https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
This dataset describes listing activity and metrics in New York City, New York for 2019. Through this dataset, we can get information regarding hosts, geographical availability and necessary metrics needed to make predictions and draw conclusions.
We will be performing our analysis using Python. In Section II, we will be cleaning, and preforming exploratory analysis on the data. Then we will explore the relationship between the names of each listing and the popularity of the listing using Natural Language Processing Methods in Section III.
import matplotlib.pyplot as plt
import plotly.offline as py
import plotly.express as px
import pandas as pd
import numpy as np
import seaborn as sns
import re
import more_itertools as more
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
df = pd.read_csv('Airbnb_NYC_2019.csv')
df #Our original Dataset
print(df.shape)
Here we find that our dataset contains 48,895 rows which represent indivual listings in New York. We are also provided with 16 different columns which contain necessary factors needed for exploratory data analysis.
We can see these different columns below:
df.columns
Data Types of Each Column
A brief analysis as to what data type each column is can be seen below:
df.dtypes
Find outliers
Get rid of listing that are too expensive
Get rid of listing that do not have avalibility for 2019
Possibly find potential negative and NA values
The Airbnb calendar for a listing does not differentiate between a booked night vs an unavailable night, therefore these bookings have been counted as "unavailable". This serves to understate the Availability metric because popular listings will be "booked" rather than being "blacked out" by a host.
avaliblity_365 : number of days the property is avalibly for rent
Our first line of action when analyzing this data is to properly clean the dataset. Having clean data will provide us with the highest quality of information needed and therefore will provide us with the most accurate predictions and correlations. Our process for cleaning the dataset can be seen below:
We will first check the number of rows with missing values. We will then remove such rows depending on the missing variables.
#Checking for null values
df.isnull().sum()
As we can see, there are 4 variables with missing values. However, we've determined that all of these missing variables do not greatly affect our NLP modelling. Therefore, we've decided to keep these listings.
On the other hand, while combing through our main dataframe, we discovered that there are a number of listings with 0 available days to be rented as well as 0 total reviews.
df[df.availability_365 == 0]
The dataframe above consists of listings with zero availability. We chose to extract these listings from the dataframe we will analyze, because zero availablity means that the listings is not available at all. Therefore, this data would be irrelevant as we are want to perform analysis on listings that have availablity.
df[df.number_of_reviews == 0]
#We will remove entries with 0 days of availabilities and 0 reviews.
#We believe that these are faulty listings that were not completely removed from the AirBnB server.
#Most of these listings were not available under the main English website but were viewable in different languages such as Spanish.
The dataframe above consists of listings with zero number of reviews. We chose to extract these values from the data we want to analyze, because number of reviews is a factor that can tell us a lot about the popularity of the listing and is an important component to our Natural Language Processing Method which we will perform. Therefore, listings with zero number of reviews would be irrelevant data and not needed for our type of analysis.
Due to the drawbacks discussed regarding 0 days of availabilities and 0 number of reviews, it would be best to simply remove these entries in the dataset we want to work with. Below, we properly remove these listings.
df = df[df.availability_365 != 0] # Remove entries with 0 days of availabilities
df = df[df.number_of_reviews != 0] # Remove entries with 0 number of reviews
df.shape
Above, we can see the dimensions of our reduced dataframe, excluding values of avaiablity and number of reviews being 0.
Now, we will examine the price range of all AirBnB listings by separating these listings into quartiles to get a general insight for this column alone. Along with the 25, 50, 75 % quartile calculations, we can also get information such as the mean, standard deviation and the minimum and maximum value of our price category.
quartiles = df[["price"]].describe()
quartiles
#From the initial quartile calculation, we see that there are listings with $0 as price.
#Inspecting listings with `price` = 0
As observed in our quartile table above, there are interestingly listings that are $0. Our next course of action would be to pull up these listings and do further research.
df.loc[lambda df: df.price == 0] # Possibly explain why some listings in our data would be $0
After multiple online searches of these listings, we've come up with mixed results of these listings' robustness. Listings such as Kimberly's seem to have been removed while all three of Adeyemi's listings are still up on AirBnB and running in 2021. Additionally, there isn't any way that we can revisit AirBnB's 2019 websites to verify these listings.
One possible reason for these $0 prices may have been a deliberate act by the hosts to temporarily remove the listing from the AirBnB market when these listings were webscraped. The additional fact that among these 0 USD listings, 3 and 2 of them belong to the same host may further indicate that it is a host-activated anomaly.
Either way, we will remove these listings just to minimize any possibility of accruing errors based on unknown anomalies.
Below, is the process of removing entries with the price = 0.
df = df[df.price != 0] # Remove listings that have $0 price.
quartiles = df[["price"]].describe() # See the statistics of price to further analyze
quartiles
Above, we can see our new summary statistics with $0 of price removed.
plt.boxplot(x = df['price'], vert = False)
plt.title('Boxplot of listing prices')
plt.xlabel('listing')
plt.ylabel('prices')
We can see that the data is quite skewed with multiple extremely high prices heavily affecting the data. Further research also shows that some of these data points are faulty listings, such as a 2012 Superbowl private room listing.
We will use the 1.5 Interquartile Range Test to determine outliers and store these listings in a different dataframe for further data analysis.
#Using the 1.5 Interquartile Range Test to determine outliers
IQR = np.percentile(df.price, 75)-np.percentile(df.price, 25)
lower_limit = np.percentile(df.price, 25)- 1.5*IQR
upper_limit = np.percentile(df.price, 75) + 1.5*IQR
print(lower_limit, upper_limit)
# Calculations for outliers within the price entry of data, used to deem what is considered "too expensive."
Here, through the IQR test, we can calculate for outliers within the price entry of data. We can see that anything above the price of 332.5 dollars would be considered too expensive, and therefore would be an outlier. Since there technically cannot be any negative values for prices, we will fix the lower limit at 10 dollars which is the minimum value as seen in our quartile table.
#Since there technically cannot be any negative values for prices, we will fix the lower limit at $10 which is the minimum value as seen in our quartile table.
outliersdf = df.loc[lambda df: df.price > upper_limit]
outliersdf
# Dataset of what is considered too expensive, conduct plots to see analysis.
# Possibly explain why these listings would be considered expensive.
outliersdf.shape
Below, we have cut all outliers and properly cleaned the dataset and can move on to exploratory data analysis. We can also see the dimensions of our cleaned up dataset.
tidydf = df.loc[lambda df: df.price <= upper_limit]
tidydf.shape
plt.figure(figsize = (12,12))
sns.heatmap(tidydf.corr(),annot=True) # Explain corr plot.
Here, we can see a proper correlation matrix between the variables within our dataset. According to this plot, we can see that the more lighter the color, the higher the correlation between each variable. Vice versa, we can see that the darker the number, the lower the correlation. From this plot, we can immediately see a high correlation of 0.6 between id and host_id. Along with this, it can be seen that number of reviews and reviews per month also have a higher correlation than others at 0.49. Along with positive correlations, we can also see negative correlations with id and number of reviews at -0.45 and with id and reviews per month at -0.45. We can also see a negative correlation between price and longitude at -0.3, although a positive correlation of 0.051 between price and latitude.
As id is a specific tag per listing, host_id is also a specific tag per host. Therefore, this positive correlation does seem to make sense as some hosts would have multiple listings under their name. Number of reviews and reviews per month also tend to have a positive correlation as each number of review contributes to a review per month and vice versa, therefore these two variables seem to have a direct relationship.
The lowest correlation we can see in this plot can be seen between both variables of reviews and id. The large negative correlation between them does seem accurate, as each id is specific and randomly generated and does not relate at all to reviews at all. Another negative correlation that was interesting comes from price and longitude. As longitude tends to be west and east, it can be seen that pricein New York does not tend to have any sort of relationship with those specific directions. Although, price and latitude show a somewhat positive correlation, meaning that price is more correlated with listings that range from the north and south rather than the west and east.
We will plot some graphs with our tidied dataset to gain further insight, as well as to visualize any more abnormal data points for further cleaning. We will not be subsetting these outliers from the main tidied dataset as these variables will not be our main focus for our NLP analysis. However, we will generate graphs of these outliers for further analysis.
sns.set(style="darkgrid")
sns.countplot(y = 'room_type', data= tidydf)
plt.title('Room Type Popularity')
plt.xlabel('Count')
plt.ylabel('Room Type')
We can see that most listings are either private rooms or the entirety of the property. This indicates that most AirBnB customers may be looking for private accomodation rather than shared spaces.
sns.countplot(y = 'neighbourhood_group', data= tidydf).set_title("Listings by Borough")
plt.xlabel('Count')
plt.ylabel('Borough')
Most listings are either in the boroughs of Brooklyn or Manhattan. This is logical given that these 2 boroughs contain the most attractions and activites in New York City. Staten Island has the lowest number of listings, which may be due to its geographical location being relatively inaccessible compared to the other boroughs.
dfn = df['neighbourhood'].value_counts()
sns.countplot(y='neighbourhood',data=tidydf, order=dfn.iloc[:10].index).set_title('Top 10 Neighborhoods of Listings')
plt.xlabel('Count')
plt.ylabel('Neighborhood')
The top 4 neighborhoods of AirBnB listings are all in Brooklyn, with Bedford-Stuyvesant being the top neighborhood by quite a margin. The 5th to 8th most popular neighborhoods are in Manhattan. This may be due to affordability of Brooklyn compared to Manhattan.
Density Map of All Listings According to Price
https://geopandas.org/gallery/create_geopandas_from_pandas.html
!pip install contextily
import geopandas as gpd
import contextily as ctx
plt.figure(figsize=(15,30))
sns_map = sns.scatterplot(x='longitude', y='latitude', hue='price',s=20, data=tidydf)
ctx.add_basemap(sns_map, crs = 'EPSG:4326', source=ctx.providers.CartoDB.Positron)
sns_map.set_axis_off()
plt.title('Density Map of Airbnb Listings')
From the density map above, we can see that most of the listings are concentrated around the boroughs of Brooklyn and Manhattan. We can also observe that Manhattan has a higher density of expensive listings, and apart from standout anomalies, prices tend to be lower as listings become further from Manhattan.
ax = sns.violinplot(x="room_type", y="price", data=tidydf)
plt.title("Price Distribution According to Room Type")
plt.xlabel('Room Type')
plt.ylabel('Price')
The plot above shows the distribution of price by room type. There is much variation in price within each room type. Through this violin plot, we can see that price greatly fluctuates between the type of room. It can be seen that entire home/apt listings tend to be more expensive than both private room and shared room, although private room is more expensive than a shared room. The logic behind the prices seem to make sense.
px.histogram(tidydf, x = 'price', title = 'Price Distribution')
From the histogram above, we can see a left-skewed unimodal distribution. A majority of the listings hover around a price between 35 to 130 USD. However, there are significant spikes at 150, 200, 250, and 300 USD bins, as well as smaller spikes at 175, 225, 275, and 325 USD bins. We believe that this may be caused by human psychology where hosts may round their prices to the nearest 25 or 50 dollar for a more "aesthetically pleasing" price number.
#by room type and borough
sns.catplot(x='neighbourhood_group', y='price', data = tidydf, hue = 'room_type')
plt.title("Listing Prices Based on Boroughs and Room Types")
plt.xlabel("Borough")
plt.ylabel("Price")
There is an obvious trend that shared rooms are the cheapest option in all boroughs, followed by private room and entire property. Additionally, Staten Island and the Bronx have a lot less listings above 150USD compared to Brooklyn and Manhattan. Additionally, as seen in our density map, Manhattan has the highest density of expensive listings compared to other boroughs.
plt.figure(figsize=(15,30))
sns_map_out = sns.scatterplot(x='longitude', y='latitude', hue='price',s=20, data=outliersdf)
ctx.add_basemap(sns_map_out, crs = 'EPSG:4326', source=ctx.providers.CartoDB.Positron)
sns_map_out.set_axis_off()
plt.title("Density Map of Outlier Airbnb Listings")
#learning about outlying prices according to room type
sns.catplot(x='room_type', y='price', data = outliersdf)
plt.title(" Catplot Price Distribution of Outliers According to Room Type")
plt.xlabel('Room Type')
plt.ylabel('Price')
Most of the outlying expensive listings are either for entire properties or private rooms. Interestingly, the most expensive listing is a 10000USD private room in Queens. This is listing is now defunct, but we can infer that it might be a long term rental situation as the listing also indicates a minimum night requirement of 100 nights.
#learning about outliers according to neighborhood and room type
sns.catplot(x='neighbourhood_group', y='price', data = outliersdf, hue = 'room_type')
plt.title(" Catplot Price Distribution of Outliers According to Room Type & Neighborhood")
plt.xlabel('Neighborhood Group')
plt.ylabel('Price')
Graphs in relation to minimum nights
#sns.boxplot(x='minimum_nights', data=df)
px.histogram(tidydf, x = 'minimum_nights', title = 'Minimum Nights Distribution')
np.mean(tidydf.minimum_nights)
Most listings in 2019 NYC have a minimum night requirement between 1 to 7 nights, with an average minimum night requirement of 6.64 and 2 nights being the hightest. There is an apparent spike of listings with a minimum night requirement of 30 nights.
There seems to be multiple outliers, with one as extreme as 1250 minimum nights. This wide range of minimum night requirements allude to the existence of not just long term vacation rental properties but also potentially long term residence.
We will determine their outliers again using the 1.5 IQR test.
#Using the 1.5 Interquartile Range Test to determine outliers
IQR_mn = np.percentile(tidydf.minimum_nights, 75)-np.percentile(tidydf.minimum_nights, 25)
lower_limit_mn = np.percentile(tidydf.minimum_nights, 25)- 1.5*IQR
upper_limit_mn = np.percentile(tidydf.minimum_nights, 75) + 1.5*IQR
print(lower_limit_mn, upper_limit_mn)
# Calculations for outliers within the price entry of data, used to deem what is considered "too expensive."
#Since there technically cannot be any negative values for min nights, we will fix the lower limit at 0.
outliersdf_mn = tidydf.loc[lambda tidydf: tidydf.minimum_nights > upper_limit_mn]
outliersdf_mn.shape
tidydf_mn = tidydf.loc[lambda tidydf: tidydf.minimum_nights <= upper_limit_mn]
tidydf_mn.shape
sns.catplot(x='neighbourhood_group', y='minimum_nights', data = tidydf_mn, hue='room_type')
plt.title('Distribution of Listings with Non-Extreme Minimum Night Values According to Borough and Room Type')
plt.xlabel('Borough')
plt.ylabel('Minimum Nights')
From the graph, it seems like Staten Island and the Bronx are not popular boroughs for long term rentals longer than 30 days. Further investigation of the few orange anomalies in the Bronx reveal that they are all from the same host, Sasha (Host ID: 2988712), who seems to have multiple 90-day listings around Claremont Village and Mount Hope in the Bronx.
Most longer term rentals in Brooklyn, Manhattan, and Queens are of entire properties. Interesting anomalies include the 90 day minimum night requirement listings for a shared room in Manhattan. Further investigation shows that these listings (Host IDs: 21628183 and 23184420) are both from LaGuardia Houses Public Housing Development looking for long term housemates. We believe that these listings are a result of hosts trying to bypass New York's legal barrier of subletting public housing for additional income, adding onto the high accessibility of AirBnB for anyone to put out a listing.
sns.catplot(x='neighbourhood_group', y='minimum_nights', data = outliersdf_mn, hue='room_type')
plt.title('Distribution of Listings with Outlying Minimum Night Values According to Borough and Room Type')
plt.xlabel('Borough')
plt.ylabel('Minimum Nights')
As expected, most extreme values of minimum night requirement are from listings in Manhattan and Brooklyn, with Queens and the Bronx each only having 1 listing and Staten Island not having any. Both listings in Queens and the Bronx seem to be long term rental leasings, just like most of the other leasings in this plot.
Graph by Reviews Per Month
px.histogram(tidydf, x = 'reviews_per_month', title = 'Reviews Per Month Distribution')
np.mean(tidydf.reviews_per_month)
The number of reviews are heavily skewed to the left with listings getting an average of 1.83 reviews per month. There are obvious outliers, with a listing getting as many as 58.5 average reviews per month.
sns.catplot(x='neighbourhood_group', y='reviews_per_month', data = tidydf, hue='room_type')
plt.title('Distribution of Monthly Reviews by Borough and Room Type')
plt.xlabel('Borough')
plt.ylabel('Reviews per Month')
We can see that most boroughs have a similar pattern of reviews per month, with 2 anomalies in Manhattan. Further analysis reveals that both listings (ID: 32678719 and 32678720) are by the same host Row NYC (Host ID: 244361589) which is a hotel in New York City. They may have more reviews as they may be more attractive to customers as an established hotel chain. Additionally, they may have multiple rooms available under the same listing, which would explain why they have listings getting 58.5 reviews in a span of 30 or 31 days.
Graphs by Availabile Days out of 365 Days
px.histogram(df, x = 'availability_365', title = 'Distribution of Available Days out of 365 days')
We can see a bimodal distribution with spkies at both ends. This tells us that on one extreme, most NYC hosts were not leasing out their place during 2019, or maybe for a duration not more than a week. On the other extreme, there were hosts that had listings available for almost the entire year of 2019.
sns.catplot(x='neighbourhood_group', y='availability_365', data = tidydf, hue='room_type')
plt.title('Distribution of Listings by Borough and Days of Availability')
plt.xlabel('Borough')
plt.ylabel('Availability 365')
All boroughs had availabilities rangine from 0 to 365, with not much of a differentiation between room types. It is interesting to note that a majority of Staten Island listings were available for more days of the year compared to other boroughs.
Use nlp libraries to clean names column
Find relationship between popularity (reviews per month) and price vs the most frequnet words that appear in names
Find relatinship between names which contain unique characters such as emojis and East Asain symbols and their locations
We will be exploring the name feature of the Airbnb dataset in depth using NLP techniques. With the use of the popular NLP library spacy we are able to clean the listing names and extract relevant information from them such as the most frequnctly used words in listings as well as non english symbols used in the listings.
In our analysis of the names of the listings we provide visualizations to show the relationship of the name listings across price subgroups and popularity subgroups which we equated to be the number of reviews per month.
Additionally we explore the unique listing names which contain non english symbols and plot their locations to see if there is any correlation between the language used in the listing and the listing location. This correlation may be present in listing names which contain chinese characters and are potentially located in china town.
#Split Data based on number of reviews per month
df_r_low = df[df['reviews_per_month'].between(0, 0.45)] #25%
df_r_mid = df[df['reviews_per_month'].between(0.46, 1.23)] #50%
df_r_high = df[df['reviews_per_month'].between(1.24, 2.68)] #75%
df_r_extra = df[df['reviews_per_month'].between(2.69, 70)] #75%
#Split Data based on price quantiles
df_p_low = df[df['price'].between(0, 70)] #25%
df_p_mid = df[df['price'].between(70.01, 109)] #50%
df_p_high = df[df['price'].between(109.01, 175)] #75%
df_p_extra = df[df['price'].between(175, 9999)] #100%
quartiles_r = df[["reviews_per_month"]].describe()
quartiles_r
quartiles_p = df[["price"]].describe()
quartiles_r
#Word cloud of names directly from the dataframe df
#includes names with punctuation and stopwords
df_nc = pd.read_csv('Airbnb_NYC_2019.csv')
from wordcloud import WordCloud
corpus = list(df_nc["name"].values)
clean_named = (' '.join(w for w in corpus if isinstance(w, str) ))
wordcloud = WordCloud(width = 1200, height = 600,
background_color ='white',
min_font_size = 10).generate(clean_named)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.savefig("wordcloud.jpg")
plt.show()
An initial analysis of our raw dataframe's listing names shows that the words with the most frequencies are neighborhood and borough names, type of room, and general location indicators. It would make sense that Brooklyn and Manhattan are some of the most frequent words because they are the 2 most popular boroughs as seen in our EDA. We can also infer that hosts use listing names to capitalize on different features of the property, from location indicators such as "heart" and "near", number of bedrooms, and different landmarks and neighborhood names. Additionally, it is interesting to note the many different variations of phrases containing "cozy", such as "cozy room", "cozy bedroom", etc.
df.columns
Why are we using the cleaned dataset for names
Reason: For the most part it takes out irrelevant listing which potenitally were not listing active in 2019. In terms of the focus of this report we ideally aim to see active listing in 2019 and compare the relationship between subgroups of price and popularity via number of review per month.
Reason: we only use the cleaned dataset because we want to only look at names that resemble the listing for airbnb in newyork during 2019. The cleaned data removes all the data that is invalid.
df.shape
#spacy used to clean names columns of punctuations and stop words
from spacy.lang.en import English
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
#Reviews per month comparison of word frequencies
from collections import Counter
dfs = [df_r_low, df_r_mid, df_r_high, df_r_extra] #list of dataframes
names_c = [] #list to store all the names per subgroup as one big string, should be list length of 4
nlp = English() #spacy tokenizer
nlp.Defaults.stop_words |= {"room", "bedroom", "apartment", "apt"} #added stop words
for i in dfs:
names_low = i["name"].values
clean_named = (' '.join(str(n).lower() for n in names_low) )
name = nlp(clean_named) #tokinzed name
filtered_name = []
# filtering name from stop words
for word in name:
if word.is_stop==False and word.is_punct==False: #takes out punctuation
filtered_name.append(word)
name_clean = ' '.join([str(w).lower() for w in filtered_name])
names_c.append(name_clean)
def plot_names(names):
split_it = names.split()
counts = Counter(split_it)
# most_common() produces k frequently encountered
most_occur = counts.most_common(20)
most_df = pd.DataFrame(most_occur, columns =['Words', 'Freq'])
plt.figure()
sns.barplot(y="Words", x="Freq", data=most_df)
namesr = ["Low review rate", "Mid review rate", "High review rate", "Extra review rate"]
#CREATES THE PLOTS
for i in range(0,len(names_c)):
plot_names(names_c[i])
plt.title(namesr[i]+' popular word counts')
plt.tight_layout()
plt.savefig("fig_review"+str(i)+".jpg")
#Price comparison of word frequencies
dfsp = [df_p_low, df_p_mid, df_p_high, df_p_extra]
names_cp = [] #list of all the names as strings
nlp = English()#spacy tokenizer
for i in dfsp:
names_low = i["name"].values
clean_named = (' '.join(str(n).lower() for n in names_low) )
name = nlp(clean_named) #tokinzed name
filtered_name = []
# filtering name from stop words
for word in name:
if word.is_stop==False and word.is_punct==False: #takes out punctuation
filtered_name.append(word)
name_clean = ' '.join([str(w).lower() for w in filtered_name])
names_cp.append(name_clean)
def plot_names(names):
split_it = names.split()
counts = Counter(split_it)
# most_common() produces k frequently encountered
most_occur = counts.most_common(20)
most_df = pd.DataFrame(most_occur, columns =['Words', 'Freq'])
plt.figure()
sns.barplot(y="Words", x="Freq", data=most_df)
namesp = ["Low price group", "Mid price group", "High price group", "Extra price group"]
#CREATES THE PLOTS
for i in range(0,len(names_c)):
plot_names(names_cp[i])
plt.title(namesp[i]+' popular word counts')
plt.tight_layout()
plt.savefig("fig_price"+str(i)+".jpg")
from matplotlib.pyplot import figure
#Price stacked word counts
lowp = names_cp[0].split()
countlowp = Counter(lowp)
countlowp50 = countlowp.most_common(50)
lowpdf = pd.DataFrame(countlowp50, columns =['Words', 'Freq_low'])
midp = names_cp[1].split()
countmidp = Counter(midp)
countmidp50 = countmidp.most_common(50)
midpdf = pd.DataFrame(countmidp50, columns =['Words', 'Freq_mid'])
highp = names_cp[2].split()
counthighp = Counter(highp)
counthighp50 = counthighp.most_common(50)
highpdf = pd.DataFrame(counthighp50, columns =['Words', 'Freq_high'])
extp = names_cp[3].split()
countextp = Counter(extp)
countextp50 = countextp.most_common(50)
extpdf = pd.DataFrame(countextp50, columns =['Words', 'Freq_ext'])
asd1 = pd.merge(lowpdf,midpdf,on='Words',how='outer')
asd2 = pd.merge(asd1,highpdf,on='Words',how='outer')
asd3 = pd.merge(asd2,extpdf,on='Words',how='outer')
finaldf = asd3.fillna(0)
freqsum = finaldf['Freq_low']+finaldf['Freq_mid']+finaldf['Freq_high']+finaldf['Freq_ext']
finaldf['freq_sum'] = freqsum
fianldf = finaldf.sort_values(by=['freq_sum'], inplace=True, ascending=False)
finaldf = finaldf.head(20)
finaldf = finaldf.set_index('Words')
finaldf = finaldf.drop(['freq_sum'], axis=1)
finaldf.plot(kind='barh', stacked=True)
plt.title("Price Groups Stacked Popular Word Counts")
plt.tight_layout()
plt.savefig("pgStack.jpg")
finaldf
#Reviews stacked word counts
lowr = names_c[0].split()
countlowr = Counter(lowr)
countlowr50 = countlowr.most_common(50)
lowpdf = pd.DataFrame(countlowr50, columns =['Words', 'Freq_low'])
midr = names_c[1].split()
countmidr = Counter(midr)
countmidp50 = countmidr.most_common(50)
midpdf = pd.DataFrame(countmidp50, columns =['Words', 'Freq_mid'])
highr = names_c[2].split()
counthighr = Counter(highr)
counthighp50 = counthighr.most_common(50)
highpdf = pd.DataFrame(counthighp50, columns =['Words', 'Freq_high'])
extr = names_c[3].split()
countextr = Counter(extr)
countextp50 = countextr.most_common(50)
extpdf = pd.DataFrame(countextp50, columns =['Words', 'Freq_ext'])
asd1 = pd.merge(lowpdf,midpdf,on='Words',how='outer')
asd2 = pd.merge(asd1,highpdf,on='Words',how='outer')
asd3 = pd.merge(asd2,extpdf,on='Words',how='outer')
finaldf = asd3.fillna(0)
freqsum = finaldf['Freq_low']+finaldf['Freq_mid']+finaldf['Freq_high']+finaldf['Freq_ext']
finaldf['freq_sum'] = freqsum
fianldf = finaldf.sort_values(by=['freq_sum'], inplace=True, ascending=False)
finaldf = finaldf.head(20)
finaldf = finaldf.set_index('Words')
finaldf = finaldf.drop(['freq_sum'], axis=1)
finaldf.plot(kind='barh', stacked=True)
plt.title("Reviews Groups Stacked Popular Word Counts")
plt.tight_layout()
plt.savefig("rgStack.jpg")
finaldf
import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize
import string
cleaned_names = []
for i in list(df["name"].values):
text_tokens = word_tokenize(str(i).lower())
clean_toks = [word for word in text_tokens if word not in nlp.Defaults.stop_words and word not in string.punctuation]
i = ' '.join([str(n) for n in clean_toks])
cleaned_names.append(i)
cleaned_names
#Finding non alphabet characters
specialC = []
#check to see if each name string can be encoded as ASCII characters if it cannot it is considered a special character
for i in list(df["name"].values):
a = str(i)
if a.isascii(): #checks if string is ASCII
continue
else:
specialC.append(i)
len(specialC)
specialC
import re
eastasian = []
def isEA(s):
s = str(s)
if len(re.findall(r'[\u4e00-\u9fff]+', s))>0: #Checks if string matches with any of the east asian unicodes
return True
return False
for i in specialC:
a = str(i)
if isEA(a): eastasian.append(i)
len(eastasian)
# sample = '法拉盛中心温馨两房两厅公寓。近一切。2 br close to everything'
# re.findall(r'[\u4e00-\u9fff]+', sample)
#japanese listings: 'NYで人気の街ブルックリンパークスロープで、暮らしてみませんか。', 'NYミッドタウン高級コンドのリビングルームに宿泊',
from emoji import UNICODE_EMOJI
emojiLists = []
def is_emoji(s):
s = str(s)
count = 0
for emoji in UNICODE_EMOJI: #Checks if string matches with any of the emoji unicodes provided by the UNICODE EMOJI package
count += s.count(emoji)
if count > 1:
return False
return bool(count)
for i in specialC:
a = str(i)
if is_emoji(a): emojiLists.append(i)
len(emojiLists)
eastasian_index = []
emoji_index = []
#Creates logical index lists which iterate through the list of names to check if a name contains an emoji or east asian characters
#We create two logical index lists using for loops
for i in list(df["name"].values):
a = str(i)
if is_emoji(a) and not a.isascii() : #must be contain an emoji
emoji_index.append(True)
else:
emoji_index.append(False)
for i in list(df["name"].values):
a = str(i)
if isEA(a) and not a.isascii() : #must be contain an east asian symbols
eastasian_index.append(True)
else:
eastasian_index.append(False)
df_eastasian = df[eastasian_index]
df_emoji = df[emoji_index]
Density Map of Listings with Emojis
plt.figure(figsize=(15,30))
sns_map_emoji = sns.scatterplot(x='longitude', y='latitude',s=20, data=df_emoji)
ctx.add_basemap(sns_map_emoji, crs = 'EPSG:4326', source=ctx.providers.CartoDB.Positron)
sns_map_emoji.set_axis_off()
plt.title('Density Map of Listings with Emojis')
sns.countplot(y = 'neighbourhood_group', data= df_emoji).set_title("Listings by Borough for Emojis")
sns.countplot(y = 'neighbourhood', data= df_emoji, order = df_emoji.neighbourhood.value_counts().iloc[:10].index).set_title("Listings by Neighborhood for Emojis")
Density Map for Non-English Listings
plt.figure(figsize=(15,30))
sns_map_noneng = sns.scatterplot(x='longitude', y='latitude',s=20, data=df_eastasian)
ctx.add_basemap(sns_map_noneng, crs = 'EPSG:4326', source=ctx.providers.CartoDB.Positron)
sns_map_noneng.set_axis_off()
plt.title('Density Map of East Asian Language Listings')
sns.countplot(y = 'neighbourhood_group', data= df_eastasian).set_title("Listings by Borough for East Asian Listings")
df_emoji[df_emoji.neighbourhood == "Hell's Kitchen"]
sns.countplot(y = 'neighbourhood', data= df_eastasian, order = df_eastasian.neighbourhood.value_counts().iloc[:10].index).set_title("Listings by Borough for East Asian Listings")
df_eastasian
print(len(df_eastasian))
df_eastasian[df_eastasian.neighbourhood == "Forest Hills"]
sns.countplot(y = 'host_name', data= df_eastasian, order=df_eastasian.host_name.value_counts().iloc[:10].index).set_title("Dist. of Host Owners for Non English Listings")
px.histogram(tidydf, x = 'price', title = 'Total Price Distribution')
px.histogram(df_eastasian, x = 'price', title = 'East Asian Price Distribution')
px.histogram(df_emoji, x = 'price', title = 'Emoji Price Distribution')
Dataset
EDA
NLP